stochastic gradient descent algorithm
A stochastic gradient descent algorithm with random search directions
Firstly, we introduce some notations. Secondly, we shall formulate our new class of SGD algorithms with random search directions which includes the SCGD algorithm. Finally, we will spell out some regularity assumptions. The SCGD algorithm as given in (3), represents the practical definition by considering the vectors in the canonical basis of R d . However, we can extend this coordinate selecting rule by using more general random vectors with a possible adaptive sampling policy. Therefore, we introduce the Stochastic Coordinate Gradient Descent algorithm with Random Search Direction (SCORS) defined for all n 1, by X n +1= X n γ nD (V n +1) f U n+1( X n), (SCORS) where the initial state X 1 is a squared integrable random vector of R d which can be arbitrarily chosen, D (v) = vv T for any vector v R d, the sequence (U n) is independent from (X n) where ( U n) is independent and identically distributed with U ([ [1, N ] ])distribution and V n is a random vector of R d sampled from an underlying distribution P n satisfying certain conditions (see Assumption 1 below). Furthermore, we 3 assume that V n +1is independent from U n +1conditionally on F n, where F n = σ (X 1, . . .
Reinforcement Learning under Model Mismatch
Aurko Roy, Huan Xu, Sebastian Pokutta
We study reinforcement learning under model misspecification, where we do not have access to the true environment but only to a reasonably close approximation to it. We address this problem by extending the framework of robust MDPs of [1, 15, 11] to the model-free Reinforcement Learning setting, where we do not have access to the model parameters, but can only sample states from it. We define robust versions of Q-learning, SARSA, and TD-learning and prove convergence to an approximately optimal robust policy and approximate value function respectively. We scale up the robust algorithms to large MDPs via function approximation and prove convergence under two different settings. We prove convergence of robust approximate policy iteration and robust approximate value iteration for linear architectures (under mild assumptions). We also define a robust loss function, the mean squared robust projected Bellman error and give stochastic gradient descent algorithms that are guaranteed to converge to a local minimum.
On the convergence of loss and uncertainty-based active learning algorithms
Haimovich, Daniel, Karamshuk, Dima, Linder, Fridolin, Tax, Niek, Vojnovic, Milan
We study convergence rates of loss and uncertainty-based active learning algorithms under various assumptions. First, we provide a set of conditions under which a convergence rate guarantee holds, and use this for linear classifiers and linearly separable datasets to show convergence rate guarantees for loss-based sampling and different loss functions. Second, we provide a framework that allows us to derive convergence rate bounds for loss-based sampling by deploying known convergence rate bounds for stochastic gradient descent algorithms. Third, and last, we propose an active learning algorithm that combines sampling of points and stochastic Polyak's step size. We show a condition on the sampling that ensures a convergence rate guarantee for this algorithm for smooth convex loss functions. Our numerical results demonstrate efficiency of our proposed algorithm.
Deep learning applied to computational mechanics: A comprehensive review, state of the art, and the classics
Vu-Quoc, Loc, Humer, Alexander
Three recent breakthroughs due to AI in arts and science serve as motivation: An award winning digital image, protein folding, fast matrix multiplication. Many recent developments in artificial neural networks, particularly deep learning (DL), applied and relevant to computational mechanics (solid, fluids, finite-element technology) are reviewed in detail. Both hybrid and pure machine learning (ML) methods are discussed. Hybrid methods combine traditional PDE discretizations with ML methods either (1) to help model complex nonlinear constitutive relations, (2) to nonlinearly reduce the model order for efficient simulation (turbulence), or (3) to accelerate the simulation by predicting certain components in the traditional integration methods. Here, methods (1) and (2) relied on Long-Short-Term Memory (LSTM) architecture, with method (3) relying on convolutional neural networks. Pure ML methods to solve (nonlinear) PDEs are represented by Physics-Informed Neural network (PINN) methods, which could be combined with attention mechanism to address discontinuous solutions. Both LSTM and attention architectures, together with modern and generalized classic optimizers to include stochasticity for DL networks, are extensively reviewed. Kernel machines, including Gaussian processes, are provided to sufficient depth for more advanced works such as shallow networks with infinite width. Not only addressing experts, readers are assumed familiar with computational mechanics, but not with DL, whose concepts and applications are built up from the basics, aiming at bringing first-time learners quickly to the forefront of research. History and limitations of AI are recounted and discussed, with particular attention at pointing out misstatements or misconceptions of the classics, even in well-known references. Positioning and pointing control of a large-deformable beam is given as an example.
Parallelized Stochastic Gradient Descent
With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique --- contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic gradient descent algorithms reach the asymptotically normal regime.
Dive Into Deep Learning -- Part 2. This is part 2 of my summary of the…
The naive approach: Take the derivative of the loss function which is an average of the losses calculated on every example in the dataset, a full update is powerful but it has some drawbacks… Drawbacks: . Can be extremely slow as we need to pass over the entire dataset to make a single update. . If there is a lot of redundancy in the training data, the benefit of a full update is very low The extreme approach Consider only a single example at a time and update steps based on one observation at a time, does that remind you of something?? Yes, it's the stochastic gradient descent algorithm or SGD. It can be effective even in large datasets but it also has some drawbacks… Drawbacks: . It can take longer to process one sample at a time compared to a full batch .
Better Methods and Theory for Federated Learning: Compression, Client Selection and Heterogeneity
Federated learning (FL) is an emerging machine learning paradigm involving multiple clients, e.g., mobile phone devices, with an incentive to collaborate in solving a machine learning problem coordinated by a central server. FL was proposed in 2016 by Kone\v{c}n\'{y} et al. and McMahan et al. as a viable privacy-preserving alternative to traditional centralized machine learning since, by construction, the training data points are decentralized and never transferred by the clients to a central server. Therefore, to a certain degree, FL mitigates the privacy risks associated with centralized data collection. Unfortunately, optimization for FL faces several specific issues that centralized optimization usually does not need to handle. In this thesis, we identify several of these challenges and propose new methods and algorithms to address them, with the ultimate goal of enabling practical FL solutions supported with mathematically rigorous guarantees.
SignSGD: Fault-Tolerance to Blind and Byzantine Adversaries
Akoun, Jason, Meyer, Sebastien
Distributed learning has become a necessity for training ever-growing models by sharing calculation among several devices. However, some of the devices can be faulty, deliberately or not, preventing the proper convergence. As a matter of fact, the baseline distributed SGD algorithm does not converge in the presence of one Byzantine adversary. In this article we focus on the more robust SignSGD algorithm derived from SGD. We provide an upper bound for the convergence rate of SignSGD proving that this new version is robust to Byzantine adversaries. We implemented SignSGD along with Byzantine strategies attempting to crush the learning process. Therefore, we provide empirical observations from our experiments to support our theory. Our code is available on GitHub https://github.com/jasonakoun/signsgd-fault-tolerance and our experiments are reproducible by using the provided parameters.
Parallelized Stochastic Gradient Descent
Zinkevich, Martin, Weimer, Markus, Li, Lihong, Smola, Alex J.
With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique --- contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic gradient descent algorithms reach the asymptotically normal regime.